This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area. In this project, I want to look at the characteristics of these users to know the type that takes longer trips and when.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import calendar
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
gobike_df = pd.read_csv('201902-fordgobike-tripdata.csv')
gobike_df.head()
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52185 | 2019-02-28 17:32:10.1450 | 2019-03-01 08:01:55.9750 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 13.0 | Commercial St at Montgomery St | 37.794231 | -122.402923 | 4902 | Customer | 1984.0 | Male | No |
| 1 | 42521 | 2019-02-28 18:53:21.7890 | 2019-03-01 06:42:03.0560 | 23.0 | The Embarcadero at Steuart St | 37.791464 | -122.391034 | 81.0 | Berry St at 4th St | 37.775880 | -122.393170 | 2535 | Customer | NaN | NaN | No |
| 2 | 61854 | 2019-02-28 12:13:13.2180 | 2019-03-01 05:24:08.1460 | 86.0 | Market St at Dolores St | 37.769305 | -122.426826 | 3.0 | Powell St BART Station (Market St at 4th St) | 37.786375 | -122.404904 | 5905 | Customer | 1972.0 | Male | No |
| 3 | 36490 | 2019-02-28 17:54:26.0100 | 2019-03-01 04:02:36.8420 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 70.0 | Central Ave at Fell St | 37.773311 | -122.444293 | 6638 | Subscriber | 1989.0 | Other | No |
| 4 | 1585 | 2019-02-28 23:54:18.5490 | 2019-03-01 00:20:44.0740 | 7.0 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 222.0 | 10th Ave at E 15th St | 37.792714 | -122.248780 | 4898 | Subscriber | 1974.0 | Male | Yes |
gobike_df.shape
(183412, 16)
gobike_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 183412 entries, 0 to 183411 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 183412 non-null int64 1 start_time 183412 non-null object 2 end_time 183412 non-null object 3 start_station_id 183215 non-null float64 4 start_station_name 183215 non-null object 5 start_station_latitude 183412 non-null float64 6 start_station_longitude 183412 non-null float64 7 end_station_id 183215 non-null float64 8 end_station_name 183215 non-null object 9 end_station_latitude 183412 non-null float64 10 end_station_longitude 183412 non-null float64 11 bike_id 183412 non-null int64 12 user_type 183412 non-null object 13 member_birth_year 175147 non-null float64 14 member_gender 175147 non-null object 15 bike_share_for_all_trip 183412 non-null object dtypes: float64(7), int64(2), object(7) memory usage: 22.4+ MB
gobike_df.isnull().sum()
duration_sec 0 start_time 0 end_time 0 start_station_id 197 start_station_name 197 start_station_latitude 0 start_station_longitude 0 end_station_id 197 end_station_name 197 end_station_latitude 0 end_station_longitude 0 bike_id 0 user_type 0 member_birth_year 8265 member_gender 8265 bike_share_for_all_trip 0 dtype: int64
gobike_df.nunique()
duration_sec 4752 start_time 183401 end_time 183397 start_station_id 329 start_station_name 329 start_station_latitude 334 start_station_longitude 335 end_station_id 329 end_station_name 329 end_station_latitude 335 end_station_longitude 335 bike_id 4646 user_type 2 member_birth_year 75 member_gender 3 bike_share_for_all_trip 2 dtype: int64
Here, I will:
start_time and end_time features to datetime datatype.hour, day will be gotten from start_time.Age column will be gotten from member_birth_year# drop missing vallues
gobike_df = gobike_df.dropna()
gobike_df.isnull().sum()
duration_sec 0 start_time 0 end_time 0 start_station_id 0 start_station_name 0 start_station_latitude 0 start_station_longitude 0 end_station_id 0 end_station_name 0 end_station_latitude 0 end_station_longitude 0 bike_id 0 user_type 0 member_birth_year 0 member_gender 0 bike_share_for_all_trip 0 dtype: int64
gobike_df.shape
(174952, 16)
# convert features to datetime dtype
gobike_df['start_time']=pd.to_datetime(gobike_df['start_time'])
gobike_df['end_time']=pd.to_datetime(gobike_df['end_time'])
gobike_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 174952 entries, 0 to 183411 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 174952 non-null int64 1 start_time 174952 non-null datetime64[ns] 2 end_time 174952 non-null datetime64[ns] 3 start_station_id 174952 non-null float64 4 start_station_name 174952 non-null object 5 start_station_latitude 174952 non-null float64 6 start_station_longitude 174952 non-null float64 7 end_station_id 174952 non-null float64 8 end_station_name 174952 non-null object 9 end_station_latitude 174952 non-null float64 10 end_station_longitude 174952 non-null float64 11 bike_id 174952 non-null int64 12 user_type 174952 non-null object 13 member_birth_year 174952 non-null float64 14 member_gender 174952 non-null object 15 bike_share_for_all_trip 174952 non-null object dtypes: datetime64[ns](2), float64(7), int64(2), object(5) memory usage: 22.7+ MB
gobike_df['hour_of_day'] = gobike_df.start_time.dt.hour.astype(int)
gobike_df['day_of_week'] = gobike_df.start_time.dt.strftime('%a')
#gobike_df['month_of_year'] = pd.DatetimeIndex(gobike_df['start_time']).month
#gobike_df['month_of_year'] = gobike_df['month_of_year'].astype(int).apply(lambda x: calendar.month_abbr[x])
gobike_df['member_age'] = 2022-gobike_df['member_birth_year'].astype(int)
gobike_df.head()
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | hour_of_day | day_of_week | member_age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52185 | 2019-02-28 17:32:10.145 | 2019-03-01 08:01:55.975 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 13.0 | Commercial St at Montgomery St | 37.794231 | -122.402923 | 4902 | Customer | 1984.0 | Male | No | 17 | Thu | 38 |
| 2 | 61854 | 2019-02-28 12:13:13.218 | 2019-03-01 05:24:08.146 | 86.0 | Market St at Dolores St | 37.769305 | -122.426826 | 3.0 | Powell St BART Station (Market St at 4th St) | 37.786375 | -122.404904 | 5905 | Customer | 1972.0 | Male | No | 12 | Thu | 50 |
| 3 | 36490 | 2019-02-28 17:54:26.010 | 2019-03-01 04:02:36.842 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 70.0 | Central Ave at Fell St | 37.773311 | -122.444293 | 6638 | Subscriber | 1989.0 | Other | No | 17 | Thu | 33 |
| 4 | 1585 | 2019-02-28 23:54:18.549 | 2019-03-01 00:20:44.074 | 7.0 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 222.0 | 10th Ave at E 15th St | 37.792714 | -122.248780 | 4898 | Subscriber | 1974.0 | Male | Yes | 23 | Thu | 48 |
| 5 | 1793 | 2019-02-28 23:49:58.632 | 2019-03-01 00:19:51.760 | 93.0 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 323.0 | Broadway at Kearny | 37.798014 | -122.405950 | 5200 | Subscriber | 1959.0 | Male | No | 23 | Thu | 63 |
gobike_df.shape
(174952, 19)
This dataset contains 174952 observations and 20 features. It means that there were over 170000 rides taken from one station to another, with 4646 bikes used. Some of these features such as duration_sec are numerical(datatype), while the others such as member_gender are categorical except for the date-and-time features which are start_time and end_time.
This dataset is however tidy and of good quality.
I am interested in:
Features that that carry the information of the riders(members) such as user_type, member_Age, member_gender will be of great support to my analysis. I will also be able to look for relationship between the riders age and trip duration.
#summary statistics of all numerical features
gobike_df.describe()
| duration_sec | start_station_id | start_station_latitude | start_station_longitude | end_station_id | end_station_latitude | end_station_longitude | bike_id | member_birth_year | hour_of_day | member_age | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 |
| mean | 704.002744 | 139.002126 | 37.771220 | -122.351760 | 136.604486 | 37.771414 | -122.351335 | 4482.587555 | 1984.803135 | 13.456165 | 37.196865 |
| std | 1642.204905 | 111.648819 | 0.100391 | 0.117732 | 111.335635 | 0.100295 | 0.117294 | 1659.195937 | 10.118731 | 4.734282 | 10.118731 |
| min | 61.000000 | 3.000000 | 37.317298 | -122.453704 | 3.000000 | 37.317298 | -122.453704 | 11.000000 | 1878.000000 | 0.000000 | 21.000000 |
| 25% | 323.000000 | 47.000000 | 37.770407 | -122.411901 | 44.000000 | 37.770407 | -122.411647 | 3799.000000 | 1980.000000 | 9.000000 | 30.000000 |
| 50% | 510.000000 | 104.000000 | 37.780760 | -122.398279 | 101.000000 | 37.781010 | -122.397437 | 4960.000000 | 1987.000000 | 14.000000 | 35.000000 |
| 75% | 789.000000 | 239.000000 | 37.797320 | -122.283093 | 238.000000 | 37.797673 | -122.286533 | 5505.000000 | 1992.000000 | 17.000000 | 42.000000 |
| max | 84548.000000 | 398.000000 | 37.880222 | -121.874119 | 398.000000 | 37.880222 | -121.874119 | 6645.000000 | 2001.000000 | 23.000000 | 144.000000 |
I noticed in the duration_sec feature, that the maximum duration is 84548 seconds and the minimum is 61; there could be an outlier in this feature.
Also, in the member_age feature, there is age of 144 years (max), which is highly impossible. This is definately an outlier.
print('lowest duration in minutes:')
print(gobike_df['duration_sec'].min()/60)
print('highest duration in minutes:')
print(gobike_df['duration_sec'].max()/60)
lowest duration in minutes: 1.0166666666666666 highest duration in minutes: 1409.1333333333334
age and duration_sec variables. Where ther any unusual points?¶Lect us check the distribution of these features.
# create a list of all numerical features
num_features = ['duration_sec', 'member_age']
gobike_df[num_features].hist(figsize=(10,5));
We can see here that they are both skewed to the right, especially the duration_sec feature.
Looking at the histogram, I will choose 6000 seconds (which is 100 minutes) as the highest duration in duration_sec and visualise the distribution.
Then I will also select 80 years as the highest age in member_age. (80 years may not seem impossible).
Let us plot each of them on a log scale.
binsize=3
bin_edges=np.arange(20, gobike_df.member_age.max()+binsize, binsize)
plt.figure(figsize=[8,6])
plt.hist(data=gobike_df, x='member_age', bins=bin_edges)
plt.xlabel('Age (years)')
plt.ylabel('Count')
plt.xlim([15,80])
plt.title('Age Distribution')
plt.show()
The member_age now looks understandable. Judging from the trend in the histogram, one could tell that the most riders are between 30-40 years and older adults from 60 and above do not ride alot.
log_binsize=0.025
bin_edges=10**np.arange(0,
np.log10(gobike_df.duration_sec.max())+log_binsize, log_binsize)
plt.figure(figsize=[8,6])
plt.hist(data=gobike_df, x='duration_sec', bins=bin_edges)
plt.xscale('log')
plt.xticks([50, 200, 500, 1500, 3000, 6000],
[50, 200, 500, 1500, 3000, 6000])
plt.xlabel('Duration (seconds)')
plt.ylabel('Count')
plt.xlim([50,6000])
plt.title('Distribution of Trip Duration (seconds)')
plt.show()
After log scaling and setting the duration limit to 6000 seconds, we now have a normal distribution which tells us that these trips duration are averagely very short.
Let us look at the proportion of these outliers in the dataset.
#filtering the values less than 6000 in `duration_sec` and less than 80 in `member_age`
outliers=((gobike_df.duration_sec>6000)|(gobike_df.member_age>80))
outlier_proportion = (outliers.sum()/gobike_df.shape[0])*100
print(f'The percentage proportion of the outliers is {round(outlier_proportion,2)}%')
The percentage proportion of the outliers is 0.52%
These outliers are less than 1% of the whole dataset.
Based on the histogram, I will be filtering out the rows with these outliers because;
I will have to remove these outliers, in order to gain logical insights from this dataset.
gobike_df=gobike_df[-outliers]
gobike_df.shape
(174037, 19)
gobike_df.duration_sec.describe()
count 174037.000000 mean 634.011733 std 507.612363 min 61.000000 25% 322.000000 50% 509.000000 75% 784.000000 max 5986.000000 Name: duration_sec, dtype: float64
gobike_df.member_age.describe()
count 174037.000000 mean 37.114866 std 9.864679 min 21.000000 25% 30.000000 50% 35.000000 75% 42.000000 max 80.000000 Name: member_age, dtype: float64
With a clearer summary statistics, majority of these bike riders are 30 to 35 years old. And majority of trip durations are 500 seconds.
Moving on to the categorical variables, let us visualize the highest number of trips made by the day by the hour and by the month.
# plotting hour of the day and day of the week together
fig, ax=plt.subplots(nrows=2, figsize=[10,12])
default_color=sns.color_palette()[0]
sns.countplot(data=gobike_df, x='hour_of_day', color=default_color, ax=ax[0])
order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
sns.countplot(data=gobike_df, x='day_of_week', color=default_color, ax=ax[1], order=order)
fig.suptitle('Trips count by hour and day', fontsize=20)
plt.show()
Observations from the hour_of_day and day_of_week:
On the hour or time of the day, it can be seen that majority of the trips are taken between the 8th to 9th hour of the day(8AM-9AM) which is in the morning, and then between the 17th and 18th hour(5PM-6PM). This clearly tells us that trips are usually taken mostly before and after work or office hours.
As for the days of the week, trips are quite less after working days. Since most trips are taken before and after office hours, I can understand why Monday to Friday is on the high side.
Let me visualize here, the kind of Ford Go-bike users that make trips the most. Also, who take trips the most? The male or the female?
# plotting user_type, member_gender together
fig, ax=plt.subplots(nrows=3, figsize=[10,19])
default_color=sns.color_palette()[0]
sns.countplot(data=gobike_df, x='user_type', color=default_color,
order=gobike_df.user_type.value_counts().index, ax=ax[0])
sns.countplot(data=gobike_df, x='member_gender',
color=default_color,order=gobike_df.member_gender.value_counts().index,ax=ax[1])
sns.countplot(data=gobike_df, x='bike_share_for_all_trip',
color=default_color, \
order=gobike_df.bike_share_for_all_trip.value_counts().index,
ax=ax[2])
ax[0].set_xlabel('user type')
ax[1].set_xlabel('member gender')
ax[2].set_xlabel('bike share for all trip')
fig.suptitle('Trips count by gender, usertype and bike share', fontsize=20)
plt.show()
# Plot bar chart in %
plt.figure(figsize=[8,6])
explode = (0, 0.1)
sorted_counts = gobike_df['user_type'].value_counts()
plt.pie(sorted_counts, explode=explode, labels = sorted_counts.index,
autopct='%1.1f%%',shadow=True, startangle = 90,counterclock = False)
plt.title('Subscriber vs. Customer (in %)', fontsize=14, fontweight='bold');
Observations from the user_type and member_gender and bike_share_for_all_trips:
The user_type plot shows shows that the subscriber users are evidently the most riders. The difference quite large.
On the member_gender plot, the men and the boys tend to ride more than the women and girls and clearly the other gender.
On the bike_share_for_all_trips, majority of the riders do not share bike during their trip.
# scatter plot of duration vs. member age with all the data
#plt.figure(figsize=[8,6])
#plt.scatter(data=gobike_df, x='member_age', y='duration_sec', marker='o', markersize=3, alpha=0.05, color="purple")
#plt.xlabel('Member Age')
#plt.ylabel('Duration (Sec)')
# Plot with transparency
plt.plot( 'member_age', 'duration_sec', "", data=gobike_df, linestyle='', marker='o',
markersize=1.5, alpha=0.05, color="red")
# Titles
plt.xlabel('Member Age')
plt.ylabel('Duration (Sec)')
plt.title('Relationship with Age and Duration', fontsize=18, loc='center')
plt.show()
There is no linear relationship here between the trip duration and the age of the riders.
order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
fig , ax=plt.subplots(ncols=2, figsize=[14,4])
a = sns.boxplot(data=gobike_df, x="day_of_week", y="member_age", showfliers=False, order=order, ax=ax[0]);
b = sns.boxplot(data=gobike_df, x="day_of_week", y="duration_sec", showfliers=False, order=order, ax=ax[1]);
a.title.set_text('Age of bike riders by day')
b.title.set_text('Trip Duration(seconds) by day')
plt.show()
fig , ax=plt.subplots(ncols=2, figsize=[14,4])
a = sns.boxplot(data=gobike_df, x="member_gender", y="member_age", showfliers=False, ax=ax[0]);
b = sns.boxplot(data=gobike_df, x="member_gender", y="duration_sec", showfliers=False, ax=ax[1]);
a.title.set_text('Age of bike riders by gender')
b.title.set_text('Trip Duration(seconds) by gender')
plt.show()
def boxgrid(x, y,**kwargs):
default_color=sns.color_palette()[0]
sns.boxplot(x, y, color=default_color, showfliers=False)
plt.figure(figsize=[15,15])
num_feat=['duration_sec','member_age']
cat_feat = ['user_type']
g=sns.PairGrid(data=gobike_df, x_vars=cat_feat, y_vars=num_feat, size=2.5, aspect=1.5)
g.map(boxgrid)
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle('User type by Age and Trip duration', fontsize=18)
plt.show();
<Figure size 1080x1080 with 0 Axes>
plt.figure(figsize=[17,10])
num_feat=['duration_sec','member_age']
cat_feat = ['bike_share_for_all_trip']
g=sns.PairGrid(data=gobike_df, x_vars=cat_feat, y_vars=num_feat, size=2.5, aspect=1.5)
g.map(boxgrid)
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle('Bike sharing by Age and Trip duration', fontsize=18)
plt.show();
<Figure size 1224x720 with 0 Axes>
My observations from the box plots above:
fig , ax=plt.subplots(ncols=2, figsize=[14,4])
a=sns.boxplot(data=gobike_df, x='hour_of_day', y='duration_sec',
showfliers=False, color='green', ax=ax[0])
b=sns.boxplot(data=gobike_df, x='hour_of_day', y='member_age',
showfliers=False, color='red', ax=ax[1])
a.title.set_text('Trip duration by the start hour')
b.title.set_text('Member age by start hour')
plt.show()
I observed here that:
# plotting categorical features
fig, ax=plt.subplots(nrows=2, figsize=[12,12])
sns.countplot(data=gobike_df, x='hour_of_day', hue='user_type',
ax=ax[0])
ax[0].legend(title='user type')
sns.countplot(data=gobike_df, x='day_of_week', hue='user_type',
ax=ax[1], order=order)
ax[1].legend(title='user type')
fig.suptitle('Count of trips taken by hour and day based on usertype', fontsize=20)
plt.show()
I noticed from the charts above that:
fig, ax=plt.subplots(nrows=3, figsize=[10,15])
sns.countplot(data=gobike_df, x='hour_of_day', hue='member_gender',
palette='tab10', ax=ax[0])
ax[0].legend(title='gender')
sns.countplot(data=gobike_df, x='day_of_week', hue='member_gender',
palette='tab10', ax=ax[1], order=order)
ax[1].legend(title='gender')
sns.countplot(data=gobike_df, x='user_type', hue='member_gender',
palette='tab10', ax=ax[2])
ax[2].legend(title='gender')
fig.suptitle('Count of trips taken by hour, day and usertype based on gender', fontsize=20)
plt.show()
Here, Male bikers have the highest number of trips as compared to female and other gender across all the times of the day and all the days of the week.
Most of the Subscribers and Customer riders are male.
# including other categorical features such as bike share for alltrip
fig, ax=plt.subplots(nrows=2, figsize=[10,10])
sns.countplot(data=gobike_df, x='day_of_week', hue='bike_share_for_all_trip',
palette='Greens', ax=ax[0], order=order)
ax[0].legend(title='bike share for all trip')
sns.countplot(data=gobike_df, x='user_type', hue='bike_share_for_all_trip', ax=ax[1])
ax[1].legend(title='bike share for all trip')
fig.suptitle('Count of trips taken by hour, day and usertype based on gender', fontsize=15)
plt.show()
Overall, bikers who do not use bike share for their entire trip have higher number of trips across all days of the week as compared to those who do. It can be seen here that 'customers' do not share bikes for their entire trip whereas in case of subscribers, a very small proportion of them haveused bike share for their entire trip.
Let me visualise the locations where the male, female, and unknown bike riders tend to start and end their trip. I will try to discover if ther is any particular area or route where these gender love to take their trip.
#plotting a mapbox for non-deviants and positive-deviants in domain 4
fig = px.scatter_mapbox(gobike_df, lat='start_station_latitude', lon='start_station_longitude',
width=800, zoom=4, color='member_gender',
height=600, hover_data=['user_type'],
)
fig.update_layout(mapbox_style='open-street-map')
fig.show()
#plotting a mapbox for non-deviants and positive-deviants in domain 4
fig = px.scatter_mapbox(gobike_df, lat='end_station_latitude', lon='end_station_longitude',
width=800, zoom=4, color='member_gender',
height=600, hover_data=['user_type'],
)
fig.update_layout(mapbox_style='open-street-map')
fig.show()
Observation:
# Create a haversine function to calculate distance between two longitudes and latitudes
def haversine_vectorize(lon1, lat1, lon2, lat2):
"""Returns distance, in kilometers, between one set of longitude/latitude coordinates and another"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
newlon = lon2 - lon1
newlat = lat2 - lat1
haver_formula = np.sin(newlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(newlon/2.0)**2
dist = 2 * np.arcsin(np.sqrt(haver_formula ))
km = 6367 * dist #6367 for distance in KM for miles use 3958
return km
#Apply haversine function
gobike_df['distance_covered_km'] = haversine_vectorize(gobike_df.start_station_latitude, gobike_df.start_station_longitude,
gobike_df.end_station_latitude, gobike_df.end_station_longitude)
gobike_df.head(5)
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | hour_of_day | day_of_week | member_age | distance_covered_km | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | 1585 | 2019-02-28 23:54:18.549 | 2019-03-01 00:20:44.074 | 7.0 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 222.0 | 10th Ave at E 15th St | 37.792714 | -122.248780 | 4898 | Subscriber | 1974.0 | Male | Yes | 23 | Thu | 48 | 2.646282 |
| 5 | 1793 | 2019-02-28 23:49:58.632 | 2019-03-01 00:19:51.760 | 93.0 | 4th St at Mission Bay Blvd S | 37.770407 | -122.391198 | 323.0 | Broadway at Kearny | 37.798014 | -122.405950 | 5200 | Subscriber | 1959.0 | Male | No | 23 | Thu | 63 | 2.321460 |
| 6 | 1147 | 2019-02-28 23:55:35.104 | 2019-03-01 00:14:42.588 | 300.0 | Palm St at Willow St | 37.317298 | -121.884995 | 312.0 | San Jose Diridon Station | 37.329732 | -121.901782 | 3803 | Subscriber | 1983.0 | Female | No | 23 | Thu | 39 | 2.003216 |
| 7 | 1615 | 2019-02-28 23:41:06.766 | 2019-03-01 00:08:02.756 | 10.0 | Washington St at Kearny St | 37.795393 | -122.404770 | 127.0 | Valencia St at 21st St | 37.756708 | -122.421025 | 6329 | Subscriber | 1989.0 | Male | No | 23 | Thu | 33 | 2.927852 |
| 8 | 1570 | 2019-02-28 23:41:48.790 | 2019-03-01 00:07:59.715 | 10.0 | Washington St at Kearny St | 37.795393 | -122.404770 | 127.0 | Valencia St at 21st St | 37.756708 | -122.421025 | 6548 | Subscriber | 1988.0 | Other | No | 23 | Thu | 34 | 2.927852 |
gobike_df[['start_station_name', 'end_station_name', 'member_gender',
'user_type','distance_covered_km']].sort_values(
by=['distance_covered_km'], ascending=False).head(5)
| start_station_name | end_station_name | member_gender | user_type | distance_covered_km | |
|---|---|---|---|---|---|
| 19827 | Foothill Blvd at Fruitvale Ave | Montgomery St BART Station (Market St at 2nd St) | Male | Subscriber | 19.806419 |
| 87602 | Broadway at Battery St | Grand Ave at Santa Clara Ave | Male | Customer | 17.095546 |
| 50859 | College Ave at Harwood Ave | Howard St at Beale St | Other | Subscriber | 16.209182 |
| 153112 | Marston Campbell Park | Valencia St at 24th St | Female | Subscriber | 15.974740 |
| 89787 | 10th St at Fallon St | San Francisco Ferry Building (Harry Bridges Pl... | Male | Subscriber | 14.580878 |
Observation:
The longest distance was covered by a male subscriber at approximately 20 Kilometers from Foothill Boulevard to Montgomery St BART Station.
# comparing the categorical features based on mean duration
fig, ax=plt.subplots(nrows=3, figsize=[10,17])
sns.barplot(data=gobike_df, x='day_of_week', y='duration_sec',
hue='user_type', palette='Blues', errwidth=0, ax=ax[0], order=order)
ax[0].set_ylabel('Avg Duration (seconds)')
ax[0].legend(loc=2, title='user type', bbox_to_anchor=(1,1))
sns.barplot(data=gobike_df, x='hour_of_day', y='duration_sec', hue='user_type',
palette='Reds', errwidth=0, ax=ax[1])
ax[1].set_ylabel('Avg Duration (seconds)')
ax[1].legend(loc=2, title='user type', bbox_to_anchor=(1,1))
sns.barplot(data=gobike_df, x='day_of_week', y='duration_sec',
hue='member_gender', palette='tab10', errwidth=0, ax=ax[2], order=order)
ax[2].set_ylabel('Avg Duration (seconds)')
ax[2].legend(loc=2, title='gender', bbox_to_anchor=(1,1))
fig.suptitle('Comparing Categorical features based on average duration', fontsize=20)
plt.show()
Customers have higher average bike trip duration than subscribers across all the times of the day (hours) and days of the week with customers having higher average bike trip duration on weekends (Sat-Sun) as compared to weekdays(Mon-Fri) and same goes for subsribers.
Female bikers have higher average bike trip duration than male bikers across all the days of the week.
The other gender tend to have the highest average trip duration during the weekends.
# compute the logarithm of duration to make multivariate plotting easier
def log_trans(x, inverse = False):
""" quick function for computing log and power operations """
if not inverse:
return np.log10(x)
else:
return np.power(10, x)
gobike_df['log_duration'] = gobike_df['duration_sec'].apply(log_trans)
def hist2dgrid(x, y,**kwargs):
palette=kwargs.pop('color')
bins_x=np.arange(18, gobike_df.member_age.max()+2, 2)
bins_y=np.arange(1, 2+0.1, 0.1)
plt.hist2d(x, y, bins=[bins_x,bins_y], cmap=palette, cmin=0.5)
plt.yticks(log_trans(np.array([50, 200, 500, 1500, 3000, 6000])),
[50, 200, 500, 1500, 3000, 6000])
#sorting the day of week in a copy of the dataframe
dow = pd.DataFrame({'day_of_week': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],})
sort_dow = dow.reset_index().set_index('day_of_week')
df_copy = gobike_df
df_copy['day_num'] = df_copy['day_of_week'].map(sort_dow['index'])
g=sns.FacetGrid(data=df_copy.sort_values("day_num"), col='day_of_week', col_wrap=3, size=3)
g.map(hist2dgrid, 'member_age', 'log_duration', color='inferno_r')
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle('Trip duration by Age and Day', fontsize=18)
g.set_xlabels('Age (years)')
g.set_ylabels('Duration (sec)')
plt.show();
The interactions between features were all supplementing each other and made sense when looked altogether. Hence, there was no big surprising observation. The usage habit difference between male/female and bike share for all trip (yes/no) wasnot significant or we can say obvious throughout the exploration, which could berelated to the imbalanced number of female riders compared to male ones. It would be interesting to see how male and female use the system differently if there were more female data and the same can be said for bike sharing for all trip feature.
gobike_df.to_csv('ford_gobike_cleaned_dataset.csv', index=False)